testingiauh912fandomcom-20200214-history
Item Response Theory (IRT)
ZAHRA FAYAZ IRT Introduction: Item response theory (IRT) provides a model-based linkage between item response and the latent characteristic assessed by a test or scale. It known as latent trait theory, strong true score theory, or modern mental test theory, it is also a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is based on the application of related mathematical models to testing data. Because it is generally regarded as superior to classical test theory, it is the preferred method for developing scales, especially when optimal decisions are demanded, as in so-called high-stakes tests e.g. the Graduate Record Examination (GRE) and Graduate Management Admission Test (GMAT).Unlike simpler alternatives for creating scales as the simple sum questionnaire responses it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, the assumption in Likert scaling that "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments" 1 (p. 197). By contrast, item response theory treats the difficulty of each item (the ICCs) as information to be incorporated in scaling items. IRT is a model-based approach to psychological measurement whose origins can be traced to seminal articles by Lawley (1943, 1944) and Tuker (1946). IRT begins with the proposition that an individual's response to a specific test item or question is determined by an unobserved mental attribute of the individual. Each of these underlying attributes, most often referred to as latent traits or abilities, is assumed to vary continuously along a single dimension usually denotedΘ .The position of person i onΘis usually referred to as the person's ability or proficiency. Theposition of item j onΘ is termed the item's difficulty. The concept most fundamental to IRT is to linkage between item response and the characteristics measured by the scale or test. Specifically, the probability of a positive response to an item is assumed to be a function of Θ , the symbol used to denote the characteristic assessed by the scale or test. The idea underlying this function is that individuals with higher values of the attribute measured by the scale or test , for example individuals with higher job satisfaction or greater mathematics achievement, should have higher probabilities of positive responses than individuals with lower values. Linking the probabilities of item responses to the characteristics assessed by the scale or test is what differentiates IRT from classical test theory. IRT can be used to assess a test’s measurement accuracy when individual ''items ''are added or deleted. Thus the unit of analysis for IRT is the individual item while for classical test theory is the complete test, because in classical test theory , true scores , error scores , parallel tests are usually defined in terms of complete tests. Analyzing scales and tests at the level of individual items allows IRT to address problem beyond the scope of classical test theory. To use IRT , investigators must first estimates the parameters that characterize items and persons. IRT methods can be used to : 1) Determine how measurement accuracy varies across ability levels for a given test 2) Construct a test with nearly constant measurement accuracy across a broad range of ability levels . An important class of questions that can be addressed by methods based on IRT involves the meanings of the different categories in the response scales used by respondents. Researcher and practitioners must often make decisions about whether to include neutral points in the response scales. Different types of IRT’s model : 1) Models that assumes the test or scales measures a single latent trait ( i,e. , unidimensional models ) using items with just two response to categories . 2) Unidimensional models with several ordered response categories 3) Undimensional models with several response categories that are not ordered , as in multiple-choice items. 4) Models for scales and tests that assess more than one latent trait. The hope is that this material will help researchers make thoughtful choices as they select the model for their specific application to measurement problems in industrial and organizational psychology. The name item response theory is due to the focus of the theory on the item, as opposed to the test-level focus of classical test theory. Thus IRT models the response of each examinee of a given ability to each item in the test. The term item is generic: covering all kinds of informative item. They might be multiple choice questions that have incorrect and correct responses, but are also commonly statements on questionnaires that allow respondents to indicate level of agreement (a rating or Likert scale), or patient symptoms scored as present/absent, or diagnostic information in complex systems. IRT is based on the idea that the probability of a correct keyed response to an item is a mathematical function of person and item parameters. The person parameter is construed as (usually) a single latent trait or dimension. Examples include general intelligence or the strength of an attitude. Parameters on which items are characterized include their difficulty (known as "location" for their location on the difficulty range), discrimination (slope or correlation) representing how steeply the rate of success of individuals varies with their ability, and characterizing the (lower) asymptote at which even the least able persons will score due to guessing (for instance, 25% for pure chance on a 4-item multiple choice item). The concept of the item response function was around before 1950. The pioneering work of IRT as a theory occurred during the 1950s and 1960s. Three of the pioneers were the Educational Testing Service psychometrician Frederic M. Lord,2 the Danish mathematician Georg Rasch, and Austrian sociologist Paul Lazarsfeld, who pursued parallel research independently. Key figures who furthered the progress of IRT include Benjamin Drake Wright and David Andrich. IRT did not become widely used until the late 1970s and 1980s, when personal computers gave many researchers access to the computing power necessary for IRT. Among other things, the purpose of IRT is to provide a framework for evaluating how well assessments work, and how well individual items on assessments work. The most common application of IRT is in education, where psychometricians use it for developing and refining exams, maintaining banks of items for exams, and equating for the difficulties of successive versions of exams (for example, to allow comparisons between results over time).3 IRT models are often referred to as latent trait models. The term latent is used to emphasize that discrete item responses are taken to be observable manifestations of hypothesized traits, constructs, or attributes, not directly observed, but which must be inferred from the manifest responses. Latent trait models were developed in the field of sociology, but are virtually identical to IRT models. IRT is generally regarded as an improvement over classical test theory (CTT). For tasks that can be accomplished using CTT, IRT generally brings greater flexibility and provides more sophisticated information. Some applications, such as computerized adaptive testing, are enabled by IRT and cannot reasonably be performed using only classical test theory. Another advantage of IRT over CTT is that the more sophisticated information IRT provides allows a researcher to improve the reliability of an assessment. IRT entails three assumptions: 1) ''A unidimensional trait denoted by ;'' 2) ''Local independence of items;'' The response of a person to an item can be modeled by a mathematical item response function (IRF). The trait is further assumed to be measurable on a scale (the mere existence of a test assumes this), typically set to a standard scale with a mean of 0.0 and a standard deviation of 1.0. 'Local independence' means that items are not related except for the fact that they measure the same trait, which is equivalent to the assumption of unidimensionality, but presented separately because multidimensionality can be caused by other issues. The topic of dimensionality is often investigated with factor analysis, while the IRF is the basic building block of IRT and is the center of much of the research and literature. Item Response Theory (aka IRT) is also sometimes called latent trait theory. This is a modern test theory (as opposed to classical test theory). It is not the only modern test theory, but it is the most popular one and is currently an area of active research. IRT requires stronger assumptions than classical test theory (we will cover these in a moment). IRT is much intuitive approach to measurement once you get used to it. In IRT, the true score is defined on the latent trait of interest rather than on the test, as is the case in classical test theory. IRT is popular because it provides a theoretical justification for doing lots of things that classical test theory does not. Some applications where IRT is handy include: 1) ''Item bias analysis--IRT'' provides a test of item equivalence across groups. We can test whether an item is behaving differently for blacks and whites or for males and females, for example. The same logic can be applied to translations of attitude scales into different languages. We can test whether the item means the same thing in English and French, for example. 2) ''Equating--''Sometimes we have scores on one test and we would like to know what the equivalent score would be on another test (e.g., versions or forms of the SAT). IRT provides a theoretical justification for equating scores from one test to another. 3) ''Tailored Testing--IRT'' provides an estimate of the true score that is not based on the number of correct items. This frees us to give different people different test items but still place people on the same scale. One particularly exciting feature of tailored testing is the capability to give people test items that are matched (close) to them. A tailored testing program for the SAT will give more difficult items to brighter test takers. This also has implications for test security -- different people get different tests. '' '' Basics of IRT Assumptions: 1. A single common factor accounts for all item covariances. This common factor is the latent trait of interest. This is stated a couple of different ways in the literature. a) unidimensionality -- there is a single latent trait b) local independence--if you partial out the test common factor from any two items, their residual covariance is zero. This assumption is never met precisely. It is obviously a problem when the test format includes several items that are related by a common problem. For example, several different items may be asked about the same story. Monte Carlo work and experience with IRT programs suggests that minor violations of this assumption don't make much difference. The programs appear to work well so long as there is a clear dominant first factor in the data. 2. Relations between the latent trait and observed response have a specific form. The line relating the trait and response is called an item characteristic curve or ICC for short (this is not the same ICC as the intraclass correlation coefficient). It is theoretically possible to have several different kinds of relations between the trait and observed response, and there is a history of test theories that correspond to different relations. Advantages of IRT: 1) More powerful test assembly with TIF and CSEM, including parallel form construction. 2) Better description of item performance (difficulty, discrimination, and guessing) and model fit. 3) More precise scoring. 4) Examinees and items are placed on the same scale. 5) Examinee scores are independent of test difficulty and the set of items used. 6) Item statistics are independent of examinee sample. 7) Enables computerized adaptive testing (CAT) to dramatically reduce test length and improve score precision. 8) Provides an estimate of each examinee's score precision, based on their responses.